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[57] ABSTRACT 

A method for storing and searching documents also 
useful in disambiguating word senses and a method for 
generating a dictionary of context vectors. The dictio- 
nary of context vectors provides a context vector for 
each word stem in the dictionary. A context vector is a 
fixed length list of component values corresponding to * 
a list of word-based features, the component values 
being an approximate measure of the conceptual rela- 
tionship between the word stem and the word-based 
feature. Documents are stored by combining the con- 
text vectors of the words remaining in the document 
after uninteresting words are removed. The summary 
vector obtained by adding all of the context vectors of 
the remaining words is normalized. The normalized 
summary vector is stored for each document. The data 
base of normalized summary vectors is searched using a 
query vector and identifying the document whose vec- 
tor is closest to that query vector. The normalized sum- 
mary vectors of each document can be stored using 
cluster trees according to a centroid consistent algo- 
rithm to accelerate the searching process. Said search- 
ing process also gives an efficient way of finding nearest 
neighbor vectors in high-dimensional spaces. 
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(8), K-D trees are useless for high dimensional nearest 
METHOD FOR DOCUMENT RETRIEVAL AND neighbor problems because they take more time than 
FOR WORD SENSE DISAMBIGUATION USING searching vectors one-by-one. 

NEURAL NETWORKS Prior art for document retrieval is well-summarized 

5 by reference (1). Saltbn's SMART system uses variable 
BACKGROUND OF THE INVENTION length lists of terms as a representation, but there is no 

The present invention is directed to a method for meaning sensitivity between terms. Any pair of terms 
storing documents that permits meaning sensitive and are either synonyms or are not synonyms; the closeness 
high speed subject area searching and retrieval. The of "car" and "driver" is the same as that of "car" and 
same method may be used for word sense disambigua- 10 "hippopotamus". 

tion, (e.g., "star" in the sky vs. movie "star"). So called "vector space methods" (1) can capture 

The most common method of document storage and meaning sensitivity, but they require that the closeness 
retrieval involves storing all documents word for word of every pair of terms be known. For a typical full-scale 
and then searching for key words in the documents system with over 100,000 terms, this would require 
using inverted indexes (1). The key word searches are l ? about 5,000,000,000 relationships, an impractical 
performed by doing a complete search through all of amount of information to obtain and store. By contrast 
the contents of the data base that contain a list of query the present invention requires only one vector per 
words. Such systems have no knowledge that "car" and word, or 100,000 vectors for such a typical full-scale 
"automobile" should be counted as the same term, so system. This is easily stored, and computation of these 
the user must include this information by a complex and 20 vectors can be partly automated, 
difficult-to-formulate query. Some systems try to solve More recently Deerwester et aL (9) have also pro- 
this problem by a built-in thesaurus, but such systems posed a method for searching which uses fixed length 
lack "meaning sensitivity" and miss many obvious facts, vectors. However, their method also requires work on 
for example, that "car" is closer to "road" than to "hip- tne orc ? er 0 f at j cast tnc sq Uarc 0 f the sum of the number 
popotamus." It is an object of the present invention to 25 0 f documents and the number of terms. This is impracti- 
provide a more meaning sensitive method of storage cal for j argc corpora 0 f documents or terms, 
and retrieval that allows simplified queries and reduces Bein ^ Smolensky (10) have previously proposed a ' 
the computing capacity required for any particular data documen t retrieval model based upon neural networks 

oase * , L . , , , . 1n that captures some meaning sensitivity. However, a 

There is currently much research and development in 30 ^ ^ ^ multiplications for lwice 

the fields of neural networks (2, 3, 4). A neural network ^ duct of ^ numbcr of documents the num- 

consists of a collection of cells and connections between ^ Qf k rds for ^ ch of a p]uraUty 0 f cycles (they 
cells, where every connection has an associated positive ^ Fof ^ , ^ 

or negative number called a 35 expected to make searches up to lO^OOO times faster. 

value. Cells employ a common rule to compute a unique 35 r r 

output, which is then passed along connections to other SUMMARY OF THE INVENTION 

cells. The particular connections and component values . . . , , ~ 

determine?hebehaviorofthenetworkwhensomes P ec- The P^l~ 

ified "input" cells are initialized to a set of values. The ™ ^£^J£ 

component values play roughly the same role in deter- 40 use ™ for wo ' d sense *samb,guat on. 

miniS neural network behavior as a program does in . Document storage according to the present invention 

determining the behavior of a computer. * performed by mputUng each document in machine 

Waltz and Pollack (5) presented a neural network «adable form into a processing system. Uninteresting 

based model for word sense disambiguation using high words are removed from consideration for the purposes 

level features which are associated with "micro-fea- 45 of preparing an easily searchable data base. A context 

turcs". The system was implemented by running several vector assigned to each word remaining in the docu- 

iterations of spreading activation which would be com- mcnt is identified from a dictionary of context vectors, 

putationally inefficient for medium-or large-scale sys- A context vector is a fixed length series of component 

t values each representative of a conceptual relationship 

Cottrell (6) constructed a similar system as Waltz and 50 between a word-based feature and the word to which 
Pollack, with the same practical limitations. Belew (7) the vector is assigned. The context vectors are corn- 
has also constructed a document retrieval system based *>ined for all of the words remaining in the document to 
upon a "spreading activation*' model, but again this obtain a summary vector for that document. The sum- 
system was impractical for medium or large-scale cor- mary vector is normalized so as to produce a normal - 
pora. McClelland and Kawamoto (2) disclosed a sen- 55 ued summary vector and this normalized summary 
tence parsing method, including word sense disam- vector is stored. Thus, the entire document has been 
biguation, using a model with a small number of oroth- reduced to a single normalized summary vector which 
gonal microfeatures. ' s use d t0 identify the documents in a data base. Search- 

An important related problem is the following. Given ing for an appropriate document is done through the 

a collection of high-dimensional vectors (e.g. all vectors 60 data base of normalized summary vectors, 
might have 200 components), find the closest vector to In order to further enhance the searching capabilities, 

a newly presented vector. Of course all vectors can a clustering algorithm is used repeatedly for a plurality 

simply be searched one-by-one, but this takes much time of levels so as to produce a tree of clustered nodes. A 

for a large collection. An object of the current inven- centroid is computed for each node based on the nor- 

tion is to provide a process which makes such searches 65 malized summary vectors assigned to that node by the 

using much less work. clustering algorithm. Additional normalized summary 

Although this problem is easily solved for very low vectors are assigned to nodes based on their proximity 

dimensional (e.g., 2-4 dimensions) vector by K-D trees to the centroids. The bottom level of the tree are a series 
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of buckets each containing the normalized summary 
vectors as assigned by the clustering algorithm. 

Searching is performed by converting an inquiry into 
a query vector. The query vector is used for identifying 
the desired documents for retrieval. The query vector is 
combined with the normalized summary vectors or 
with the centroids of a node to locate the closest nor- 
malized summary vector or group of closest normalized 
summary vectors. The search is conducted down 
through the tree taking the branch with the closest 
centroid. At the bottom level, each normalized sum- 
mary vector in a bucket is checked to identify the clos- 
est one. A depth first tree walk is continued through the 
entire tree. An entire branch can be eliminated from the 
search if its centroid fails a test based upon the closest 
vector found so far and centroids of other nodes. By 
using the cluster trees, the closest normalized summary 
vector can be identified quickly without needing to 
examine every normalized summary vector in the data 

bfl Th c method of the present invention can also be used 
for word sense disambiguation. A series of words sur- 
rounding an ambiguous word in a text are input into a 
processing system in machine readable form. Uninter- 2J 
esting words are removed and a context vector is lo- 
cated for each of the remaining words. The context 
vectors are combined to obtain a summary vector for 
the series of words. Ambiguous words have a plurality 
of context vectors, one context vector for each of the 30 
meanings of the ambiguous word. The context vector 
closest to the summary vector is used to identify the 
appropriate meaning for the word. 

By storing documents in the form of summary vec- 
tors in accordance with the present invention, searching 35 
for appropriate documents is greatly simplified and 
matches to queries are improved. The cluster tree em- 
ploying centroid consistent clustering gives an efficient 
way of finding nearest neighbor vectors in high-dimen- 
sional spaces. This has application in many schemes 40 
beyond the document searching embodiment described 
herein. 

Other objects and advantages of the present invention 
will become apparent during the following description 
of the presently preferred embodiments of the invention 45 
taken in conjunction with the drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a flow chart of the document storage and 
retrieval system of the present invention. 50 

FIG. 2 is a flow chart of the document storage sub- 
system of FIG. 1. 

FIG. 3 is a flow chart of summary vector creation 
used in the document storage subsystem of FIG. 2. 

FIG. 4 is a flow chart of text preprocessing for use in 55 
the summary vector creation method of FIG. 3. 

FIG. 5 is a flow chart of a text conversion method for 
use in the summary vector creation method of FIG. 3. 

FIG. 6 is a flow chart of cluster tree creation for use 
in the document storage subsystem of FIG. 2. 60 

FIG. 7 is a flow chart of the document retrieval sub- 
system of FIG. 1. 

FIG. 8 is a flow chart of a cluster tree search for use 
in the document retrieval system of FIG. 7. 

FIG. 9 is a flow chart of a search subroutine for use 65 
in the cluster tree search of FIG. 8, 

FIG. 10 is a flow chart for word sense disambiguation 
using the summary vector creation method of FIG. 3. 



FIG. 11 is a flow chart for the creation of a dictionary 
of context vectors for use in the present invention. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

The document storage and retrieval and word sense 
disambiguation methods of the present invention are 
based upon a representation scheme using context vec- 
tors. A context vector is a fixed length vector having a 
component value for each of a plurality of word-based 
features. For using the methods of the present inven- 
tion, a set of features that are useful for discriminating 
words and documents in a particular language and do- 
main are required. A set of sample features are provided 
in table 1. It is presently recommended that for use in 
the present invention, context vectors of between 150 
and 500 features be used. The number and meaning of 
the features will be the same for all of the context vec- 
tors in a dictionary of context vectors for use in the 
present invention. 

TABLE 1 



art 
walk 
research 

friend 
cold 
light 
white 
insect 
fruit 
future 
paper 
work 
afternoon 
snow 
smart 
write 



science 

lie -down 

fun 

family 

hard 

big 

blue 

plant 

fragrant 

hight 

metal 

early 

morning 

hot 

dumb 

type 



woman 


machine 


politics 


play 
motion 


sea 


entertainment 


ipeak 


yell 


aad 


exciting 


boring 


baby 


country 


hot 


toft 


sharp 


heavy 


small 


red 


black 


yellow 


animal 


mammal 
bush 


tree 


flower 


stink 


past 


present 


low 


wood 


plastic 


building 


bouse 


factory 


late 


day 


night 


sunny 


cloudy 


rain 


cold 


humid 


bright 


car 


truck 


bike 


cook 


eat 


spicy 



A system can be built once the specified features have 
been determined. Each word or word stem in a dictio- 
nary of words needs to have a context vector defined 
for it. A context vector is made up of component values 
each indicative of the conceptual relationship between 
the word defined by the context vector and the speci- 
fied features. For simplicity, the values can be restricted 
t0 +2, + 1, 0, - 1, -2. A component is given a positive 
value if its feature is strongly associated with the word. 
0 is used if the feature is not associated with the word. 
A negative value is used to indicate that the word con- 
tradicts the feature. As an example, using the features in 
table 1, the vector for "astronomer" might begin 



< +2 . +1 +1 -1 -1 
0+2000 
0 0 +1 +1 +1 
+2 +1 -1 +1 -1 



Under such a representation, "car" and "automobile" 
are expected to be very similar, "car" and "driver" 
somewhat similar, and "car" and "hippopotamus" un- 
corrected. This is the essence of the word-based mean- 
ing sensitivity of the current invention, and it extends to 
document and query representations as discussed be- 
low. It is noted that the interpretation of components of 
context vectors is exactly the same as the interpretation 
of weights in neural networks. 
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It would be wasteful to create context vectors for all and/or words are being used to form a summary vector 
words. Uninteresting words such as a, an, the, or, for, each of the texts and/or words may be given a different 
etc. are removed from the dictionary. Furthermore, weight 30. When a summary vector is being created for 
only the stem of each remaining word is kept. For ex- a single document the weighting step is irrelevant. The 
ample, •'investments" becomes "invest". Context vec- 5 weight of the single document would be made equal to 
tors are thus only defined for the word stems. 1 so that this step has no effect on the processing. 

It is contemplated that the dictionary of context vec- In order to eliminate uninteresting words such as 
tors could be created by hand. Although, it is expected common function words and in order to find word 
that such a task would be time consuming, once the task stems for the remaining words, the document is prepro- 
is completed it need not be completed again. Thus, the 10 cessed 32 as shown in FIG. 4. Any common uninterest- 
brute force method may be used in which for each word ing words such as a, an, the, or, for, etc. are removed 
in the dictionary of context vectors, a component value from consideration 34. The remaining words are re- 
is manually selected for each feature in the context duced to their stems by stripping off suffixes 36. For 
vector for that word. This is repeated until the context example, investments becomes invest. Any well known 
vectors are finished for each of the words in the dictio- 15 algorithm for reducing words to their stems may be 
nary. By limiting the dictionary to word stems, much used. (1) 

redundant effort can be avoided. Another possibility is h may be possible to enhance the accuracy of the 
to automate much of the task of context vector creation, searching techniques by using additional processing on 
as described below. the documents. For example, a parsing algorithm can be 

As an option, context vectors may be lengthened to 20 use d to identify the subject, predicate and verb in each 
include random features in addition to the word-based sentence. The subject and verb or the subject, verb and 
features. For a random feature, the component values predicate can then be assigned 38 a greater weight than 
for each context vector are generated at random. The the other words in each sentence. Another method is to 
use of random features in context vectors will assist in give the first 100 (or so) words in a document extra 
keyword recognition. The more random features that 25 weight. Other methods of assigning weights 38 to 
are used, the more sensitive the system is to locating an words in a document may also be used. There are well 
actual search word. The fewer random features that are known algorithms based on the frequency of use of a 
used, the more meaning-sensitive the system is. For WO rd in a document or in a series of documents which 
example, without random features, a search for "car" ma y be used so as to assign a different weight to each of 
and a search for "automobile" would have very similar 30 the remaining words in the document. For example, (1, 
results. But using random features, the two words p . 304) stem s in document d might be weighted by 
would have vectors that are distinguishable by the ran- 
dom features and the searches would thus be more sen- (tf<d, 5 )) 0og( N/df(s) )) 
sitive to appearance of the words themselves. 

Referring now to the drawings, FIG. 1 illustrates the 35 where 

document storage and retrieval system of the present tf(d,s) is the number of appearances of stems in docu- 
invention using the context vectors. The system is oper- ment d; 

ated by a computer processing system. Documents 12 N is the total number of documents; and 

are entered into the processing system in machine read- df(s) is the number of documents in which stem s 

able form. The document storage subsystem 14 converts 40 appears. . , 

the documents into summary vectors 16 based upon the The preprocessed document ma y^ en _ be A C0 ^^ 
context vectors of the words in the document. The 40 into vector form as shown m FIG. 5. A summary 
summary vectors 16 are stored for use in response to . vector is initialized 42 by setting all component values 
search requests. The document storage subsystem can to 0. Each of the words remaining in the preprocessed 
be enhanced by arranging the summary vectors in ac- 45 text is considered one at a time 44 For each word, its 
cordance with a cluster tree. User queries 18 are con- associated context vector is located 46 one at a time in 
verted to a vector for use by the retrieval system 20 in a dictionary of context vectors. The context vector for 
identifying responsive documents from the data base. A the word is multiplied 48 by the words weight if 
user query 18 may also be augmented by submitting weights were assigned 38 during preprocessing. This 
selected documents 22 which are reduced to a summary 50 multiplication step 48 when used, produces weighted 
vector such that the summary vector is then used as the context vectors. The context vector or weighted con- 
query vector by the retrieval subsystem to obtain other text vector, as the case by be, is added 50 to the sum- 
documents similar to the selected documents. mary vector being formed for the document. For each 
Referring now to the document storage subsystem of feature in the vectors, the component value from the 
FIG 2 and the summary vector creation method 19 of 55 context vector of the word is added to the component 
FIG 3 a summary vector is generated for each docu- value for the summary vector being formed. This re- 
mem 12. A summary vector is a fixed length vector suits in a new summary vector for use with the next 
having a length equal to the number of features. This is word in the document. After the context vectors for all 
the same length as the context vectors. The same pro- of the remaining words to be considered in the docu- 
cess is performed for each document 12 in deterrruning 60 ment have been added, a gross summary vector 52 For 
its summary vector. A summary vector of the fixed the document is obtained. 

length is initiated 24 with all 0 component values. Sum- Returning now to FIG. 3, if a summary vector is 
mary vector creation may be done for a single docu- being determined for a plurality of documents in a 
ment or in the case of a query based on a plurality of query, the gross summary vector obtained from a sum- 
texts or words, the summary vector is representative of 65 mation process can be multiplied by a weight and added 
all of the texts or words used in the query. An index t is to the summary query vector being formed 54 Sum- 
set 26 to zero and then incremented 28 to count through mary vector creation may then take place for the next 
all of the texts or words used in a query. If several texts document 56 being used to form the query. When all the 
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documents being used in the formation of the summary troids" as used in thb application shall I mean ' •PP""' 

vectors have befn processed, the gross summary vector mate centroids. In other words, a » 

£ completed establish centroid consistent clusters. Each node is iden- 

The gross summary vector from the summation pro- tified by its centroid for use in the ^hmgproc«s_ 
cei b normalized 58 Normalization is performed by 5 In forming a next level of clus ters, *« "ote in the 

Sng each component value in the vector by the level above become parent nodes to a set of child nodes 

.Sut! magnitude of the vector. The magnitude of the below. Only the summary vectors to a parent 

tSor b determined by taking the square root of the node are used » *e clustering ^f^amOie 

squareofdlofthewmponentvaluesinthevector.Thb child nodes which branch from thai '>V«%-™ » 

results in a normalized summary vector. By providing 10 repeated across the entire level of parent nodes and on 

nomalteeS ^nZ^ v«tXeach document b given subsequent levels so tna. fewer and fewe, ^ntext . vec- 

™ «S weightinTin a data base in which tbey are tors we assigned to the child nodes on e«h >ower level 

«orTrV normalized summary vector b output 60 for The nodes form a tree pattern in which each node 

■^i^S!^hMt^^^nomaiud branches from a node » the levd 

smnmaV vector for each document in the data base. 15 summary vector is assigned ^to a node on «ch kvel of 

na Hhe storage of the normalized 20 the tree, each node poL to ~h «dW I sugary 
summary Tecwrs can be arranged to further reduce vector ass gned to it. The nodes on the bottom level 
searchina time by creating cluster trees. Cluster tree may be referred to as buckets. 
fo™2 « b dLribed in greater detail with respect Once a cluster tree has been set up, 
o RG 6 An initial parent node at the top of the tree matter to add a new document summary vector to tfie 
SdSi ^.^^c^dloftenc^Ited 25 tree. The initial branches of the tree are examined for 
s^mVrv vectors tothe data base. A series of child the closest centroid. The summary vector « ; assigned I to 
^eLh^El&g from the initial parent node is the node with the closest centred Then the branches 
..If « „ „«, Evf 1 of The cluster tree A centroid from that node are examined for the closest child node 
£ IS^o^ TZZ k divTde the centroid, and the process b continued until a bucket * 
^^^S^^^cln^AmauttM The new f^^troS'of^ b^ctet 

M rfTiHt vectors in the group. One popular 35 consistency of the clusters. If a bucket gets too big, the 
t££SS^SS^ igo^hm P b consent summary vectors on the bucket « * divided into sub 
k-m r clustering^! 1) ^Convergent k-means clustering as summary vec- 

"? ISSff i^SJ partition that groups the tors in a'data base, we now turn to 
vecio^Td^rTForLmple. take the first k 40 trieval system of FIG. 7. As. juiquiry can be made usmg 

E "yS^^^Syinttechssterwiththe In order to treat a term comprised of ««dwori. 
dote* cwtroid switch the vector to that cluster and with the same weight as a single key word, the context 
uSe St of the clusters which gain or lose a vectors of the words comprising the term are added 
update the cemroios oi we ciusic » ^ together and then normalized to produce a single nor- 

'TSSeZS 2 until convergence is achieved, that is malized context vector for the term. The query vector 

■A?C through all of the fummary vectors causes ^J^SA ESiSZSi 

tt °sZ cXTgence may be rather time consuming to with respect to FIG. 3. 1, b not necessary to normalize 

achieve the clustering algorithm can be simplified by 55 58 the query vector. 

Sngtne number operations through the algorithm. If the «W ' *^££% ££ 

After sav 99 iterations of the algorithm, the centroids without the benefit of cluster trees 66, the query vector 

can L frozen Then one more pass can be made through b compared with each summary vector in the data base 

STof the summaS vectors dbtributing the vectors to in a brute force manner to .dentify the summary vector 

appropriate cE but without updating the cen- <0 which is closest 6» to the query vector. The relative 

K While! usm this approximation, the centroids distance between a query vector and a summary vector 

wm no longe * «act centroids, the approximate cen- can be determined by mu tiplying the query vector by a 

Sis wUlTsuflicient for the use of the present inven- summary vector. Multipbcation is performed by multi- 

t on II Hb noTneSy o the present invention that the plying the component values for each feature together 

troids be precisebut rather that the clusters be cen- 65 and summing the results. The resuh obtained can be 

S TconsbteT^e la fpass through the summary compared with the magnitudes of the product vectors 

vectorT^arantees that the clusters are centroid consb- obtained with each of the summary vectors The prod- 

^S^S^S^^Mm. From herein, "cen- uct vector with the maximum magnitude identifies the 
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closest summary vector to the query vector. Alterna- The pruning formula given above provides for rough 
tively, the relative distance between the summary vec- pruning of the tree. Greater pruning can be accom- 
tors and a query vector can be determined by subtract- plished if more work is put into the pruning algorithm, 
ing the query vector from each of the summary vectors. When the simple pruning algorithm fails it may be desir- 
The magnitude of the difference vectors may then be 5 able to use linear programming to attempt to prune the 
used to identify the closest summary vector to the query path. This would require additional computational time 
vector. However, in this case it is the difference vector but it may be worthwhile for pruning a high level 
with the minimum magnitude which is the closest sum- branch. 

mary vector For a linear programming approach, we seek to find 

By using the cluster tree storage mechanism of the »0 out whether the following set of constraints has a feasi- 
present invention, the searching task can be greatly We solution. Suppose we are considering node N for 
accelerated. Searching through a cluster tree 70 for the P™™g ■*» v * * e closest vector found so far We 
closest summary vector to a query vector is described check whether any vector V« can exist that satisfies: 
IS respect wFIGS. 8 and 9 ThTquery vector is used 1- For each node I* in the tree path from the inmal 
in the subroutine of FIG. 9 to identify the summary » p«Mt -ode to N. * must be that V« «■ closer tothe 
vector that is closest to the query vector. The search is c ^ a f^ a 

performed using a depth first tree walk. A branch is "f™^^ between VO and V is less than the 
followed down the tree taking the node at each level "' * " \~ ~~ Oo „ JTV, 

having the centred cl^st to the query vector The ^S^^'^n^ as a linear pro- 
searcbpr^ ^ Mm fa one $mcd h ^ m If ^ 

(bucket) without children is reached 76. Each of *e ^ * ^ f ^ no ^on) 

summary vectors in the bucket is compared with the ^ n ^ dc$ccndcnts may ^ pruncdi 

query vector to identify the closest summary vector 78. Ag shQwn m piG 8) after ^ dosest summary vector 
The closest summary vector V is remembered and up- l5 fa found( u may be removcd from consideration and the 
dated if during the search a closer summary vector is repeated t0 find the next dosest surn mary vec- 

identified. • tor. This process may be repeated for as many summary 

Before a subsequent node in the depth first tree walk vectors as are required, 
is checked for a closest vector, first it is determined Referring now to FIG. 10, the present invention is 
whether the node can be completely pruned. A node is ^ shown for m achieving word sense disambiguation, 
pruned if it is not possible for a closer normalized sum- text surr0 unding an ambiguous word is input into 

mary vector to be assigned to the node than the closest ^ the p roc€ ssing system. A summary vector is then 
normalized summary vector found so far without vio- created 92 for the text surrounding the ambiguous 
lating centroid consistency. Suppose we are examining WO rd. Summary vector creation was described with 
a node with centroid C for pruning. If C is the centroid 35 re f erence to FIG. 3. Weights may be assigned to each of 
of any sibling node then if it is true that any vector tne words in the series of words. One weighting mecha- 
closer to the query vector Q than V (closest vector msm W ould be to give the greatest weight to words 
found so far) must be closer to C than C\ then we may which are closest to the ambiguous word in the text, 
prune the node with centroid C as well as any nodes Uninteresting words are removed from the series and 
branching therefrom. This may be computed by com- 40 the remaining words except for the ambiguous word are 
paring 82 the distance between C and C with twice the located in the dictionary of context vectors. The con- 
sum of the distance between C and Q and the distance tex t vector for each of the remaining words is multi- 
between Q and V. If the distance between C and C is plied by its weight so as to produce a weighted context 
greater, then the node , with centroid C (and descen- vector for each of the remaining words. For each of the 
dents) may be pruned. If not, the formula is repeated for 45 remaining words being considered in the text surround- 
the remaining sibling nodes since any one of them may mg t he ambiguous word, the weighted context vectors 
permit pruning to proceed. If none of the sibling nodes are summed together. The sura of all of the weighted 
achieve pruning of the node, then the search continues context vectors is the summary vector for the series of 
through the node with centroid C and down into the words. The normalization step is not necessary for 
subsequent level if there is one. By using the pruning 50 word sense disambiguations. 

formula 82, a node can be pruned when any vector The word being disambiguated is then considered 94. 
closer to the query vector than the closest vector V The dictionary of context vectors contains a different 
must be closer to the centroid C than to the centroid C. context vector for each of the different meanings which 
Therefore, that vector could not be assigned to node C could be applied to the ambiguous word. The plurality 
or else it would violate centroid consistency. If this is a 55 of context vectors associated with the ambiguous word 
bottom node, then all of the summary vectors on the are retrieved from the dictionary of context vectors, 
node must be checked 78 to determine whether any are The summary vector obtained from the surrounding 
closer than the closest vector found so far. If a closer text is then compared 96 with each of the context vec- 
suramary vector is found, it will then become the closest tors associated with the ambiguous word. The relative 
summary vector 80 being remembered. Thus, bottom 60 distances between the summary vector and each of the 
nodes are thoroughly searched if not pruned. The context vectors can be determined by multiplying the 
search continues in a depth first tree walk pruning off vectors together or from subtracting the vectors from 
entire branches when possible. These searches continue each other. The context vector which is determined to 
through the tree until all branches have either been be closest to the summary vector of the surrounding 
checked or pruned. After the entire tree has been 65 text is identified as the appropriate meaning for the 
searched, the closest summary vector has been identi- ambiguous word. If there are more than two possible 
fied. The document associated with the summary vec- meanings for the word, these can be ordered 98 accord- 
tor can be retrieved. ing to their relative closeness to the summary vector for 
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the surrounding text. The appropriate meaning can be 
output for the processing system. 

The foundation for the workings for the present in- 
vention is the dictionary of context vectors. At least 
part of the data base needs to be entered by hand. For 3 
each of the features making up all the features of the 
context vector an integer should be entered according 
to how that feature correlates, suggests and is consistent 
with the word stem for which the context vector is 
being formed. For example, a scale of from -5 to + 5 l0 
may be used. It may be further advantageous to normal- 
ize the context vectors in the dictionary so that the 
average squared weight is the same for each feature. 
Alternatively, normalization may be performed for 
each word so that the average squared weight is the l5 
same for each word in the dictionary. The creation of 
context vectors will be a time consuming task but fortu- 
nately it only needs to be performed once. Due .to the 
subtleties of human experience, it is preferred that at 
least a core group of words have their context vectors 2 o 
entered by humans. , 

An automated method for building a dictionary of 
context vectors can be achieved with the aid of a train- 
ing corpus 102, i.e., an initial set of documents, as shown 
in FIG. 11. For each word stem, the number of docu- 25 
ments which the word stem appears in are counted 104. 
We let F*be the fraction of training corpus documents 
in which the word stem w appears. All of the word 
stems are then ordered 106 by their information content 
which is defined by the equation: 30 

- FJogjF*- ( 1 - F*)log2( 1 - F w ). 



It is seen from this equation that words appearing in half 
of the documents have the highest information content ^ 
while those appearing in either all or none of the docu- 
ments have the lowest content. 

A core group of word stems having the highest infor- 
mation content are taken from the top of the list. For 
example, the first 1,000 word stems having the highest 
information content may be selected. For this core 
group of word stems, context vectors are generated by 
hand 108. Temporarily a 0 vector is assigned to any 
other word stems remaining 110. The word stem w 
which has temporarily been assigned a 0 vector having 
the highest information content is then taken. For this 
word stem, the context vectors of word stems that are 
close to w in the training corpus documents are 
weighted by their distance from w. For example, the 10 
stems preceding and following each occurence the 
word stem may be used. The weighted context vectors 
are added up to produce a context vector for the word 
stem 112, The context vector can then be normalized 
114. The resulting context vector becomes w's perma- 
nent context vector. The next word stem w having the 
highest information content from those word stems 55 
which have only a temporary 0 vector is then selected 
and the process is repeated 116. It is recommended that 
at least 1000 documents be used. Once the dictionary of 
context vectors is completed, the invention may be used 
to its full benefit. For such dictionary building, multiple 60 
meanings of a word do not enter in; all stems have only 
one context vector. 

Of course, it should be understood that various 
changes and modifications to the preferred embodi- 
ments described above will be apparent to those skilled 65 
in the art. For example, numerous weighting schemes, 
parsing algorithms, clustering algorithms or methods 
for creating a context vector dictionary are possible 



within, the scope of the present invention. These and 
other changes can be made without departing from the 
spirit and scope of the invention and without diminish- 
ing its attendant advantages. It is therefore intended that 
such changes and modifications be covered by the fol- 
lowing claims. 
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I claim: 

1. A method for storing a searchable representation of 
a document comprising the steps of: 
inputting a document containing a series of words in 

machine readable form into a processing system; 
removing from consideration any words in said series 
of words that are also found in a predetermined list 
of uninteresting words; 
locating in a dictionary of context vectors a context 
vector for each word remaining in said series of 
words, each context vector providing for each of a 
plurality of word-based features, a component 
value representative of a conceptual relationship 
between said word and said word-based feature; 
combining the context vectors for each word remain- 
ing in said series of words to obtain a summary 
vector for said document; 
normalizing said summary vector to produce a nor- 
malized summary vector; and 
storing said normalized summary vector. 
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2. The method of claim 1 wherein said step of com- 
bining the context vectors comprises summing the con- 
text vectors, 

3. The method of claim 1 further comprising assign- 
ing weights to each of the words remaining in said series 5 
of words and wherein said step of combining the con- 
text vectors comprises multiplying the context vector 
for each remaining word by the weight assigned to said 
word to produce a weighted context vector for each 
remaining word and summing the weighted context 10 
vectors for each remaining word to obtain said sum- 
mary vector for said document. 

4. The method of claim 3 wherein words in a begin- 
ning portion of said document are given more weight 
than other words in said document 15 

5. The method of claim 3 further comprising parsing 
sentences in said document to identify subjects and 
verbs of a sentence and wherein said subjects and verbs 
are given more weight than other words in said sen- 
tence. 20 

6. The method of claim 1 further comprising the step 
of standardizing the words remaining in said series of 
words by replacing a plurality of said words each with 
a word stem corresponding to the word being replaced. 

7. The method of claim 1 wherein each context vec- 25 
tor has additional component values selected at random. 

8. A method for generating a searchable representa- 
tion of a query comprising the steps of: 

inputting a query comprising a plurality of query 
words or texts, each text containing a series of 30 
words in machine readable form, into a processing 
system; 

assigning a weight to each query word or text; 

for each query word or text, locating a context vector 
or normalized summary vector, respectively, each 35 
of said vectors providing for each of a plurality of 
word-based features a component value represen- 
tative of a conceptual relationship between said 
query word or text and said word-based feature; 

multiplying the vector of each query word or text by 40 
the weight assigned to said query word or text to 
produce a weighted context vector for each query 
word and a weighted summary vector for each 
text; and 

summing the weighted contect vectors and weighted 45 
summary of said plurality of query words and texts 
to generate a summary for said query. 

9. The method of claim 8 wherein each of said vec- 
tors includes additional component values determined 
through random selection. 50 

10. A method for cataloging searchable representa- 
tions of a plurality of documents comprising the steps of 

(a) generating a normalized summary vector for each 
document of said plurality of documents to create a 
corresponding plurality of normalized summary 55 
vectors; 

(b) assigning each of said normalized summary vec- 
tors to one of a plurality of nodes in accordance 
with a centroid consistent clustering algorithm; 

(c) calculating a centroid for each of said nodes; 60 

(d) repeating steps (b) and (c) for the normalized 
summary vectors on one or more of said nodes to 
create a tree of nodes. 

11. The method of claim 10 further comprising the 
step of repeating said step (d) to add additional levels to 65 
said tree of nodes. 

12. The method of claim 10 wherein said step of gen- 
erating a normalized summary vector comprises: 



inputting said each document containing a series of 
words in machine readable form into a processing 
system; 

removing from consideration any words in said series 
- of words that are also found in a predetermined list 

of uninteresting words; 
locating in a dictionary of context vectors a context 
vector for each word remaining in said series of 
words, each context vector providing for each of a 
plurality of word-based features a component 
value representative of a conceptual relationship 
between said word and said word-based feature; 
combining the context vectors for each word remain- 
ing in said series of words to obtain a summary 
vector for said each document; and 
normalizing said summary vector to produce a nor- 
malized summary vector corresponding to said 
each document. 
13. A document cataloging and retrieval method 
comprising the steps of: 
inputting a plurality of documents in machine read- 
able form into a processing system; 
generating a normalized summary vector for each 
document of said plurality of documents to create a 
corresponding plurality of normalized summary 
vectors; 

assigning each of said normalized summary vectors in 
said plurality of normalized summary vectors to 
one of a plurality of nodes in accordance with a 
centroid consistent clustering algorithm; 
for a plurality of said nodes, assigning each of the 
normalized summary vectors assigned to said node 
to one of a plurality of nodes on a subsequent level 
in accordance with a centroid consistent clustering 
algorithm and repeating this step for nodes on sub- 
sequent levels to form a cluster tree of nodes, each 
node characterized by an approximate centroid; 
forming a query vector; 

searching said tree of nodes for a normalized sum- 
mary vector which is closest to said query vector 
by conducting a depth first tree walk through said 
tree of nodes and pruning a node and any nodes 
branching therefrom if, upon comparing the ap- 
proximate centroid of said node with said query 
vector and the approximate centroid of another 
node branching from the same node that said node 
branches from, it is not possible for a closer nor- 
malized summary vector to be on said node than 
the closest normalized summary vector found so 
far without violating centroid consistency; and 
retrieving the document corresponding to the nor- 
malized summary vector obtained after searching 
or pruning all nodes on said tree. 
14. A method for locating on a cluster tree the closest 
vector to a query vector comprising the steps of: 
providing a cluster tree for a plurality of vectors, said 
tree having a parent node to which all of the vec- 
tors in said plurality of vectors are assigned and 
subsequent levels each with a plurality of nodes 
branching from a node on a previous level, each 
node including a subset of the vectors from the 
node it branches from characterized by an approxi- 
mate centroid wherein the vectors on a node of a 
subsequent level are each closer to the approximate 
centroid of its node than to the approximate cen- 
troid of any other node on said subsequent level 
branching from the same node; 
forming a query vector; 
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searching said cluster tree of nodes for a normalized 
summary vector which is closest to said query 
vector by conducting a depth first tree walk talking 
the node branching from a parent having the clos- 
est approximate centroid of all the other nodes 5 
branching from the parent and pruning a node and 
any nodes branching therefrom if it is not possible 
for a closer normalized summary vector to be on 
said node than the closest normalized summary 
vector found so far without violating centroid con- 10 
sistency and; 

identifying the closest normalized summary vector 
obtained from said searching. 

15. The method of claim 14 further comprising the 
step of adding a summary vector to said cluster tree by 15 
assigning-said summary vector to the parent node hav- 
ing an approximate centroid closer to the suinmary 
vector than any other parent node and then assigning 
said summary vector to the node branching from said 
parent node on each subsequent level that has an ap- 20 
proximate centroid closer to the summary vector than 
any other node on said subsequent level and branching 
from the same node on a previous level to which the 
vector has already been assigned. 

16. A word sense disambiguation method comprising 25 
the steps of: 

inputting into a processing system in machine read- 
able form a series of words including and surround- 
ing an ambiguous word in a text; 

removing from consideration any words in said series 30 
of words that are also found in a predetermined list 
of uninteresting words; 

locating in a dictionary of context vectors a context 
vector for each word remaining in said series of 
words; t , 35 

combining the context vectors for each remaining 
word to obtain a summary vector for said series of 
words; 

locating a plurality of context vectors in said dictio- 
nary of context vectors corresponding to a plural- 40 
ity of meanings for said ambiguous word; and 

combining said summary vector with each of said 
context vectors associated with said ambiguous 
word to obtain a relative distance between each of 
said context vectors and said summary vector, said 45 
relative distances serving as a measure of the rela- 
tive appropriateness of each of said meanings, 

17. The word sense disambiguation method of claim 
16 further comprising the steps of assigning a weight to 

50 



each word remaining in said series of words and multi- 
plying the context vector for each of said remaining 
words by the weight assigned to said each word to 
produce a weighted context vector for said each re- 
maining word and wherein said step of combining com- 
prises summing the weighted context vectors. 

18. The word sense disambiguation method of claim 
16 further comprising determining a most appropriate 
meaning for said ambiguous word based upon said rela- 
tive distances and outputting said most appropriate 
meaning. . 

19. The word sense disambiguation method of claim 
16 further comprising outputting the meanings of said 
ambiguous word in order of the relative distances deter- 
mined for the corresponding context vectors. 

20. A method for generating a dictionary of context 
vectors comprising: 

providing a corpus of documents, each document 
. including a series of words; 
creating a list of all of said words in said corpus of 
documents; 

inputting component values to generate context vec- 
tors for a core group of words; 
temporarily assigning a zero context vector to the 
words on said list not included in said core group; 
for each word with a zero vector in order of appear- 
ance on said list, combining the context vectors for 
words appearing close to said word within each of 
the series of words in said documents to generate a 
context vector for said word. 

21. The method of claim 20 wherein said step of com- 
bining comprises assigning weights to the words in said 
series of words appearing close to said word based on 
relative closeness, multiplying the context vectors of 
said words in said series of words by said weights to 
form weighted context vectors and summing said 
weighted context vectors. 

22. The method of claim 21 further comprising nor- 
malizing said sum of weighted context vectors. 

23. The method of claim 20 further comprising the 
steps of: 

counting the number of documents each word is 
found in; 

using said count to determine an information content 

for each word; and 
selecting the core group of words from the words 
having the highest information content. 
♦ ♦ ♦ * * 
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